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1 . Introduction . 

Ever since the appearance of Chomsky’s famous review (1957) of 
Skinner’s Verbal Behavior (1957 ), linguists have conducted an effective 
and active campaign against the empirical or conceptual adequacy of any 
learning theory whose basic concepts are those of stimulus and response j 
and whose basic processes are stimulus conditioning and stimulus 
sampling. 

Because variants of stimulus-response theory had dominated much of 
experimental psychology in the two decades prior to the middle fifties, 
there is no doubt that the attack of the linguists has had a salutary 
effect in disturbing the theoretical complacency of many psychologists. 
Indeed, it has posed for all psychologists interested in systematic 
theory a number of difficult and embarrassing questions about language 
learning and language behavior in general. However, in the flush of 
their initial victories, many linguists have made extravagant claims 
and drawn sweeping, but unsupported conclusions about the inadequacy of 
stimulus-response theories to handle any central aspects of language 
behavior. I say ’’extravagant" and "unsupported" for this reason. The 
claims and conclusions so boldly enunciated are supported neither by 
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careful mathematical argument to show that in principle a conceptual 
inadequacy is to be found in all standard stimulus-response theories, 
nor by systematic presentation of empirical evidence to show that the 
basic assumptions of these theories are empirically false. To cite two 
recent books of some importance, neither theorems nor data are to be 
found in Chomsky (1965) or Katz and Postal (196^)) but one can find 
rather many useful examples of linguistic analysis, many interesting and 
insightful remarks about language behavior, and many badly formulated 
and incompletely worked out arguments about theories of language 
learning. 

The central aim of the present paper and its projected successors 
is to prove in detail that stimulus-response theory, or at least a 
mathematically precise version, can indeed give an account of the learning 
of many phrase- structure grammars. I hope that there will be no 
misunderstanding about the claims I am making. The mathematical defi- 
nitions and theorems given here are entirely subservient to the 
conceptual task of showing that the basic ideas of stimulus-response 
theory are rich enough to generate in a natural way the learning of 
many phrase -structure grammars. I am not claiming that the mathematical 
constructions in this paper correspond in any exact way to children's 
actual learning of their first language or to the learning of a second 
language at a later stage. A number of fundamental empirical questions 
are generated by the formal developments in this paper, but none of the 
relevant investigations have yet been carried out. Some suggestions for 
experiments are mentioned below. I have been concerned to show that the 
linguists are quite mistaken in their claims that even principle , 
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apart from any questions of empirical evidence, it is not possible for 
conditioning theory to give an account of any essential parts of language 
learning. The main results in this paper, and its sequel dealing with 
general context-free languages, show that this linguistic claim is false. 

The specific constructions given here show that linguistic objections 
to the processes of stimulus- conditioning and sampling as being unable 
in principle to explain any central aspects of language behavior must 
be reformulated in less sweeping generality and with greater intellectual 
precision in order to be taken seriously by psychologists. 

The mathematical formulation and proof of the main results presented 
here require the development of a certain amount of formal machinery. 

In order not to obscure the main ideas, it seems desirable to describe 
in a preliminary and intuitive fashion the character of the results. 

The central idea is quite simple--it is to show how by applying 
accepted principles of conditioning an organism may theoretically be 
taught by an appropriate reinforcement schedule to respond as a finite 
automaton. An automaton is defined as a device with a finite number of 
internal states. When it is presented with one of a finite number of 
letters from an alphabet, as a function of this letter of the alphabet 
and its current internal state, it moves to another one of its internal 
states, (a more precise mathematical formulation is given below.) In 
order to show that an organism obeying general laws of stimulus conditioning 
and sampling can be conditioned to become an automaton, it is necessary 
first of all to interpret within the usual run of psychological concepts, 
the notion of a letter of an alphabet and the notion of an internal 
state. In my own thinking about these matters, I was first misled by 
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the perhaps natural attempt to identify the internal state of the automaton 
with the state of conditioning of the organism. This idea, however, 
turned out to he clearly wrong. In the first place, the various possible 
states of conditioning of the organism correspond to various possible 
automata that the organism can be conditioned to become. Roughly 
speaking, to each state of conditioning there corresponds a different 
automaton. Probably the next most natural idea is to look at a given 
conditioning state and use the conditioning of individual stimuli to 
represent the internal states of the automaton. In very restricted cases 
this correspondence will work, but in general it will not, for reasons 
that will become clear later. The correspondence that turns out to work 
is the following: the internal states of the automaton are identified 

with the responses of the organism. There is no doubt that this surface 
behavioral identification will make many linguists concerned with deep 
structures (and other deep, abstract ideas) uneasy, but fortunately it 
is an identification already suggested in the literature of automata 
theory by E. F. Moore and others. The suggestion was originally made 
to simplify the formal characterization of automata by postulating a 
one-one relation between internal states of the machine and outputs of 
the machine. From a formal standpoint this means that the two separate 
concepts of internal state and output can be welded into the single 
concept of internal state and, for our purposes, the internal states can 
be identified with responses of the organism, The correspondence to be 
made between letters of the alphabet that the automaton will accept and 
the appropriate objects within stimulus-response theory is fairly 
obvious. The letters of the alphabet correspond in a natural way to sets 
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of Stimulus elements presented on a given trial to an organism. So 
again, roughly speaking, the correspondence in this case is between the 
alphabet and selected stimuli. It may seem like a happy accident, but 
the correspondences between inputs to the automata and stimuli presented 
to the organism, and between internal states of the machine and responses 

of the organism, are conceptually very natural. 

Because of the conceptual importance of the issues that have been 
raised by linguists for the future development of psychological theory, 
perhaps above all because language behavior is the most characteristically 
human aspect of our behavior patterns, it is important to be as clear as 
possible about the claims that can be made for a stimulus-response theory 
whose basic concepts seem so simple and to many so woefully inadequate 
to explain complex behavior, including language behavior. I cannot 
refrain from mentioning two examples that present very useful analogies. 
First is the reduction of all standard mathematics to the concept of 
set and the simple relation of an element being a member of a set. From 
a naive standpoint, it seems unbelievable that the complexities of higher 
mathematics can be reduced to a relation as simple as that of set 
membership. But this is indubitably the case, and we know in detail 
how the reduction can be made. This is not to suggest, for instance, 
that in thinking about a mathematical problem or even in formulating 
and verifying it explicitly, a mathematician operates simply in terms 
of endlessly complicated statements about set membership. By appropriate 
explicit definition we introduce many additional concepts, the ones 
actually used in discourse. The fact remains, however, that the reduction 
to the single relationship of set membership can be made and in fact has 




been carried out in detail. The second example, which is close to our 
present inquiry, is the status of simple machine languages for computers. 
Again, from the naive standpoint it seems incredible that modern computers 
can do the things they can in terms either of information processing 
or numerical com.puting when their basic language consist’ essentially 
Just of finite sequences of I's and O'sj but the more complex computer 
languages that have been introduced are not at all for the convenience 
of the machines bub for the convenience of human users. It is perfectly 
clear how any more complex language, like ALGOL, can be reduced by a 
compiler or other device to a simple machine language. The same attitude, 
it seems to me, is appropriate toward stimulus-response theory. We 
cannot hope to deal directly in stimulus-response connections with complex 
human behavior. We can hope, as in the two cases Just mentioned, to 
construct a satisfactory systematic theory in terms of which a chain of 
explicit definitions of new and ever more complex concepts can be 
introduced. It is these new and explicitly defined concepts that will 
be related directly to the more complex forms of behavior. The basic 
idea of stimulus-response association or connection is close enough in 
character to the concept of set membership or to the basic idea of 
automata to make me confident that new and better versions of stimulus- 
response theory may be expected in the future and that the scientific 
potentiality of theories stated essentially in this framework has by no 
means been exhausted. 

Before turning to specific mathematical developments, it will be 
useful to make explicit how the developments in this paper may be used to 
show that many of the common conceptions of conditioning, and particularly 
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the claims that conditioning refers only to simple reflexes like those 
of salivation or eye blinking, are mistaken. The mistake is to confuse 
particular restricted applications of the fundamental theory with the 
range of the theory itself. Experiments on classical conditioning do 
indeed represent a narrow range of experiments from a broader conceptual 
standpoint. It is important to realize, however, that experiments on 
classical conditioning do not define the range and limits of conditioning 
theory itself. The main aim of the present paper is to show how any 
finite automata, no matter how complicated, may be constructed purely 
within stimulus-response theory. But from the standpoint of automata, 
classical conditioning represents a particularly trivial example of an 
automaton. Classical conditioning may be represented by an automaton 
having a one-letter alphabet and a single internal state. This is the 
most trivial possible example of an automaton. The next simplest one 
corresponds to the structure of classical discrimination experiments. 

Here there is more than a single letter to the alphabet, but the 
transition table of the automaton depends in no way on the internal 
state of the automaton. In the case of discrimination, we may again 
think of the responses as corresponding to the internal states of the 
automaton. In this sense there is more than one internal state, contrary 
to the case of classical conditioning, but what is fundamental is that 
the transition table of the automaton does not depend on the internal 
states but only on the external stimuli presented. It is of the utmost 
importance to realize that this restriction, as in the case of classical 
conditioning e:g)eriments , is not a restriction that is in any sense 
inherent in conditioning theory itself. It merely represents 
concentration on a certain restricted class of experiments. 
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Leaving the technical details for later, it is still possible to 
give a very clear example of conditioning that goes beyond the classical 
cases and yet represents perhaps the simplest non- trivial automaton. By 
non- trivial I mean; there is more than one letter in the alphabet; 
there is more than one internal state; and the transition table of the 
automaton is a function of both the external stimulus and the current 
internal state. As an example, we may take a rat being run in a maze. 
The reinforcement schedule for the rat is set up so as to make the 
rat become a two- state automaton. We will use as the external alphabet 
of the automaton a two-letter alphabet consisting of a black or a white 
card. Each choice point of the maze will consist of either a left turn 
or a right turn. At each choice point either a black card or a white 
card will be present. The following table describes both the rein- 
forcement schedule and the transition table of the automaton. 



LB 

LW 

RB 

RW 



L R 

1 0 

0 1 

0 1 

1 0 



Thus the first row shows that when the previous response has been left 
(l) and a black stimulus card (b) is presented at the choice point, with 
probability one the animal is reinforced to turn left. The second row 
indicates that when the previous response is left and a white stimulus 
card is presented at the choice point, the animal is reinforced 
100 per cent of the time to turn right, and so forth, for the other 
two possibilities.' From a formal standpoint this is a simple schedule 
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of reinforcement, but already the double aspect of contingency on both 
the previous response and the displayed stimulus card makes the schedule 
more complicated in many respects than the schedules of reinforcement 
that are usually run with rats. I have not been able to get a uniform 
prediction from my experimental colleagues as to whether it will be 
possible to teach rats to learn this schedule ■ (Most of them are con~ 
fident pigeons can be trained to respond like non-trivial two-state 
automata.') One thing to note about this schedule is that it is 
recursive in the sense that if the animal is properly trained according 
to the schedule, the length of the maze will be of no importance. He 
will always make a response that depends only upon his previous response 
and the stimulus card present at the choice point. 

There is no pretense that this simple two-state automaton is in 
any sense adequate to serious language learning. I am not proposing, for 
example, that there is much chance of teaching even a simple one-sided 
linear grammar to rats. I am proposing to psychologists, however, that 
already automata of a small number of states present immediate experimental 
challenges in terms of what can be done with animals of each species. 

For example, what is the most complicated automaton a monkey may be 
trained to imitate? In this case, there seems some possibility of 
approaching at least reasonably complex one-sided linear grammars (using 
the theorem that any one-sided linear grammar is definable by a finite- 
state automaton). In the case of the lower species, it will be necessary 
to exploit to the fullest the kind of stimuli to which the organisms are 
most sensitive and responsive in order to maximize the complexity of the 
automata they can imitate. 
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If a generally agreed upon definition of complexity for finite 
automata can "be reached, it will he possible to use this measure to gauge 
the relative level of organizational complexity that can be achieved by 
a given species, at least in terms of an external schedule of conditioning 
and reinforcement. I do want to emphasize that the measures appropriate 
to experiments with animals are almost totally different from the measures 
that have been discussed in the recent literature of automata as 
complexity measures for computations. What is needed in the case of 
animals is the simple and orderly arrangement on a complexity scale of 
automata that have a relatively small number of states and that accept 
a relatively small alphabet of stimuli. The number of distinct training 
conditions is not a bad measure and can be used as a first approximation. 
Thus in the case of classical conditioning, this number is one. In the 
case of discrimination between black and white stimuli, the number is 
two. In the case of the two- state automaton described for the maze 
experiment, this number Is four, but there are some problems with this 
measure. It is not clear that we would regard as more complex than 
this two- state automaton, an organism that masters a discrimination 
experiment consisting of six different responses to six different 
discriminating stimuli. Consequently, what I’ve said here about 
complexity is pre-systematic . I do think the development of an appro- 
priate scale of complexity can be of considerable theoretical interest, 
particularly in cross^species comparison of intellectual power. 

The remainder of this paper is devoted to the technical development 
of the general ideas already discussed. Section 2 is concerned with 
standard notions of finite and probabilistic automata. Readers already 
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familiar with this literature should skip this section and go on to the 
treatment of stimulus-response theory in Section 3» It has been 
necessary to give a rigorous axiomatization of stimulus-response theory 
in order to formulate the representation theorem for finite automata in 
mathematically precise form. However, the underlying ideas of stimulus- 
response theory as formulated in Section 3 will be familiar to all 
experimental psychologists. In Section 4 the most important result of 
the paper is proved, namely, that any finite automaton can be represented 
at asymptote by an appropriate model of stimulus -response theory. In 
Section 5 some extensions of these results to probabilistic automata 
are sketched, and an example from arithmetic is worked out in detail. 

The relationship between stimulus-response theory and grammars is 
established in Section 5 by known theorems relating automata to grammars. 
The results in the present paper are certainly restricted regarding 
the full generality of context-free languages. Weakening these restric- 
tions will be the focus of a subsequent paper. 

Some results on tote hierarcbics and the sense of Miller, 

Galanter and Pribram (19^0) are also given in Section 4. The 
representation of tote hierarchies by stimulus-response models follows 
directly from the main theorem of that section. 
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2. Automata 

The account of automata given here is formally self-contained, 
hut not really self-explanatory in the deeper sense of discussing and 
interpreting in adequate detail the systematic definitions and theorems. 
I have followed closely the development in the well-known article of 
Rahin and Scott (1959), and for prohahilistic automata, the article of 
Rahin(l963 ) . 

Definition 1. A structure < A, S, M, s^, F > i^ a finite 

( deterministic ) automaton if and only if 

(i) A a finite , non - empty set ( the set of states of (X)> 

(ii) S a finite , non - empty set ( the alphabet ) , 

(iii) M 3^ a function from the Cartesian product A X S to A 

(M defines the transition table of OC) f 

(iv) 3^ A (s^ is the initial state of (TO, 

(v) F is a subset of A (F ^ the set of final states of (Tt). 

In view of the generality of this definition it is apparent that there 
are a great variety of automata, but as we shall see, this generality 
is easily matched by the generality of the models of stimulus-response 
theory. 

In notation now nearly standardized, S is the set of finite 
sequences of elements of S , including the empty sequence --n-. The 
elements of S are ordinarily called tapes . If o"^, ..., are in 
S , then X = 0-^ • • • is in S , (as we shall see, these tapes 
correspond in a natural way to finite sequences of sets of stimulus 

*X* 

elements.) The function M can be extended to a function from A X 2 
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to A by the following recursive definition for s in A, x in 2? , 
and O' in 2, 

M(s, ^ ) = s 

M(s, xcr) = M(M(s, x), o') . 

Definition 2. A tape x of is accepted by (K. if and only 

if M(s^, x) in F. A tape x that is accepted by Ot a 

sentence of OC • 

We shall also refer to tapes as strings of the alphabet S. 

Definition 3 . The language L generated by OX, the set _of 
all sentences of 0^ j ? the set of all tapes accepted by OC* 

Regular languages' are sometimes defined Just as those languages generated 
by some finite automata. An independent, set- theoretical characteri- 
zation is also possible. The basic result follows from Kleene’s (1956) 
fundamental analysis of the kind of events definable by McCulloch-Pitts 
nets. Several equivalent formulations are given in the article by Rabin 
and Scott. From a linguistic standpoint probably the most useful 
characterization is that to be found in Chomsky (1963? PP» 368-371)* 
Regular languages are generated by one-sided linear grammars. Such 
grammars have a finite number of rewrite rules, which in the case of 

right-linear rules^ are of the form 

A -4 xfi , 

Whichever of several equivalent formulations is used, the fundamental 
theorem, originally due to Kleene, but closely related to the theorem of 
Myhill given by Rabin and Scott, is this. 
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Theorem on Regular Languages . Any regular language is generated 'by 
some finite automaton, and every finite automaton generates £ regular 
language . 

For the main theorem of this article, we need the concepts of isomor- 
phism and equivalence of finite automata. The definition of isomorphism 
is just the natural set- theoretical one for structures like automata. 

Definition 4. Let OC = < A, M, s^, F > and QC - 
< A* , 21’ j j s^’ , F’ > ^ finite automata . Then CC and <7C ' are 

isomorphic if and only if there exists a function f such that 

(i) f one - one , 

(ii) Domain ^ f ^ A U 2) and range of f A’ U 21' , 

(iii) For every a in A U 2^ 

a e A ^ and only if f(a) e A’ , 

(iv) For every s m A and cr in S 
f(M(s, cr) = M'(f(s), f(<r)), 

(v) f(sj = s^' , 

(vi) For every s in A 

s € F and only if f(s) e F' . 

It is apparent that conditions (i)-(iii) of the definition imply that 
for every a in A U 22 

a e S if and only if f(a) e S» ^ 
and consequently, this condition on 22 need not be stated. From the 
standpoint of the general algebraic or set- theoretical concept of 
isomorphism, it would have been more natural to define an automaton in 
terms of a basic set B A U 22 , and then require that A and 22 are 
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both subsets of Bo Rabin and Scott avoid the problem by not making S 
a part of the automaton o They define the concept of an automaton 
ec^- < A, M, s^, F > with respect to an alphabet but for the 
purposes of this paper it is also desirable to include the alphabet S 
in the definition of O'C in order +o make explicit the natural place 
of the alphabet in the stimulus-response models, and above all, to 
provide a simple setup for going from one alphabet S to another S* „ 

In any case, exactly how these matters are handled is not of central 
importance here. 

Definition 5 . Two a-.,.i,omata are equivalent rf and .only if they 
accept exactly the same set of tapes ^ 

This is the standard defi.nition of equivalence in the literature,. As 
it stands, it means that the definition of equivalence is neither 
stronger nor weaker than the definition of isomorphism, because, on 
the one hand, equivalent automata are c-t.eariy no'^. necessarily isomorphic, 
and, on the other hand, isomorphic au-i;omata w,L'‘h different- aiphabeis 
are not equivalent. It would seem natural to weaken the notion of 
equivalence to include two aJ;omata that generate distinct but 
isomorphic languages, or se+ s of tapes, but this point w.iil bear on matters 
here only tangentially. 

A finite automa-’-on is c onnected .if for every s’^'-aie s there is a 
tape X such that, M(s^, x) s. it .is easy to snow that every automabon 
is equivalent to a connected automation, and the represerb;ation theorem 
of Section k is restricted ro connected automa^.a. Tt is apparent "hat 
from a functiona;.. s‘* andpcint, stai.es tnat cannon :-e reached by any 
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tape are of no interest, and consequently, restriction to connected 
automata does not represent any real loss of generality. The difficulty 
of representing automata with unconnected states by stimulus-response 
models is that we have no way to condition the organism with respect to 
these states, at least in terms of the approach developed here. 

It is also straightforward to establish a representation for 
probabilistic automata within stimulus- response theory, and, as will 
become apparent in Section 5, there are some interesting differences in 
the way we may represent deterministic and probabilistic automata within 
stimulus-response theory. 



Definition 6. A structui-e OC~<A, S, p, s^, F> is_a ( finite ) 
probabilistic automaton if and only if 

(i) A is a finite , non - empty set , 

(ii) S is a finite , non - empty set, 

(iii) p i£ a function on A X ^ such that for each s in A and 
cr in S, P. is a probability density over A, i.e.. 



(a) for each s' in A, P„o-(s')>0, 
s^eA 






X 



(iv) s is in A , 



(v) F is ja subset of A . 



The only change in generalizing from Definition 1 to Definition 6 is 
found in (iii), although it is natural to replace (iv) by an initial 
probability density. It is apparent how Definition 4 must be modified to 
characterize the isomorphism of probabilistic automata, and so the 
explicit definition will not be given. 
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3 . Stimulus-response Theory 

The formalization of stimulus-response theory given here follows 
closely the treatment in Estes and Suppes (1959) and Suppes and 
Atkinson (i960). Some minor changes have been made to facilitate the 
treatment of finite automata, but it is to be strongly emphasized that 
none of the basic ideas or assumptions has required modification. 

The theory is based on six primitive concepts, each of which has 
a direct psychological interpretation. The first one is the set S of 
stimuli, which we shall assume is not empty, but which we will not 
restrict to beingceither finite or infinite on all occasions. The second 
primitive concept is the set R of responses and the third primitive 
concept the set E of possible reinforcements. As in the case of the 
set of stimuli, we need not assume that either R or E is finite, but 
in the present applications to the theory of finite automata we shall 
make this restrictive assumption. (For a proper treatment of phonology 
it will clearly be necessary to make R, and probably E as well, 
infinite with at the very least a strong topological if not metric 
structure . ) 

The fourth primitive concept is that of a measure \i on the set of 
stimuli. In case the set S is finite this measure is often the number 
of elements in S. For the general theory we shall assume that the 
measure of S itself is always finite, i.e., |i(s) < » . 

The fifth primitive concept is the sample space X. Each element 
X of the sample space represents a possible experiment, that is, an 
infinite sequence of trials. In the present theory, each trial may be 
described by an ordered quintuple <C, T, s, r, e>, where C is the 
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conditioning function, T is the subset of stimuli presented to the 
organism on the given trial, s is the sampled subset of T, r is the 
response made on the trial, and e is the reinforcement occurring on 
that trial. It is not possible to make all the comments here that are 
required for a full interpretation and understanding of the theory. For 
those wanting a more detailed description, the two references already 
given will prove useful. A very comprehensive set of papers on stimulus- 
sampling theory has been put together in the collection edited by 
Neimark and Estes (1967). The present version of stimulus-response 
theory should in many respects be called stimulus- sampling theory^ 
but I have held to the more general stimulus-response terminology to 
emphasize the Juxtaposibion of the genera] ideas of behavioral psychology 
on the one hand and linguistic theory on the other. In addition, in the 
theoretical applications to be made here the specific sampling aspects of 
stimulus-response theory are not as central as in the analysis of experi- 
mental data. 

Because of the importance to be attached later to the set T of 

stimuli presented on each trial, its Interpretation in classical learning 

theory should be explicitly mentioned. In the case of simple learning, 

for examplejin classical conditioning, the set T is the same on all 

trials and we would ordinarily identify the sets T and S. In the 

case of discrimination learning, the set T varies from trial to trial, 

and the application we are making to automata theory falls generally 

under the discrimination case. The conditioning function C is defined 

over the set R of responses and C is the subset of S conditioned 

r 

or connected to response r on the given trial. How the conditioning 
function changes from trial to trial is made clear by the axioms. 



o 
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From the quintuple description of a given trial it is clear that 



certain assumptions about the behavior that occurs on a trial have 
already been made . In particular it is apparent that we are assuming 
that only one sample of stimuli is drawn on a given trial, that exactly 
one response occurs on a trial and that exactly one reinforcement 
occurs on a trial. These assumptions have been built into the set- 
theoretical description of the sample space X and will not be an 



e3q)licit part of our axioms. 

Lying behind the formality of the ordered quintuples representing 
each trial is the intuitively conceived temporal ordering of events on 



any trial, which may be represented by the following diagram: 



State of i^resen- sampling 

conditioning tation ^ r ^ — » '^espbnse';— » reinforce- 

at beginning of stimuli ment 

of trial stimuli 



state of 
conditioning 
at beginning 
of new trial 



n 



T 



n 



s 



n 



n 



n 



n+1 



The sixth and final primitive concept is the probability measure P on 
the appropriate Borel field of cylinder sets of X. The exact description 
of this Borel field is rather complicated when the set of stimuli is not 
finite, but the construction is standard,. and we shall assume the reader 
can fill in details familiar from general probability theory. It is to 
be emphasized that all probabilities must be defined in terms of the 



measure P. 



We also need certain notation to take us back and forth between 
elements or subsets of the sets of stimuli, responses, and reinforcements 
to events of the sample space X. First, r^ is the event of re.sponse 
r on trial n, that is, the set of all possible experimental realizations 
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or elements of X having r as a response on the nth trial. Similarly, 

e is the event of response r’s being reinforced on trial n. The 
r ,n 

event e is the event of no re inf oi* cement on trial n. In like 
o,n 

fashion, is the event of conditioning function C occurring on 

trial n, is the event of presentation set T occurring on trial n, 

and so forth. Additional notation that does not follow these conventions 
will be explicitly noted. 

We also need a notation for sets defined by events occurring up to 

a given trial. Reference to such sets is required in expressing that 

central aspects of stimulus conditioning and sampling are independent 

of the pattern of past events. If I say that is an n- cylinder set, 

I mean that the definition of does not depend on any event occurring 

after trial n. However, an even finer breakdown is required that takes 

account of the postulated sequence C ->T ->s ->r on a given 

trial, so in saying that Y^ is a C^-cylinder set what is meant is 

that its definition does not depend on any event occurring after 

on trial n, i.e., its definition could depend on ^ or C^, for 

example, but not on T or s . As an abbreviated notation, I shall 

write Y(C^) for this set and similarly for other cylinder sets. The 

notation Y without additional qualification shall always refer to 
n 

an n- cylinder set. 

Also, to avoid an overly cumbersome notation, event notation of 

the sort already indicated will be used, e.g., e , for reinforcement 

r ,n 

of response r on trial n, but also the notation e C for the 

i ^ n 

event of stimulus o-»s being conditioned to response 



r on trial n. 
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To simplify the formal statement of the axioms it shall he 
assumed without repeated explicit statement that any given events on 
which probabilities are conditioned have positive probability. Thus, 
for example, the tacit hypothesis of Axiom S2 is that P(T^) > 0 and 
P(T^) > 0 . 

The axioms naturally fall into three classes. Stimuli must be 
sampled in order to be conditioned, and they must be conditioned in order 
for systematic response patterns to develop. Thus, there are naturally 
three kinds of axioms: sampling axioms*, conditioning axioms*, and 

response axioms. A verbal formulation of each axiom is given together 
with its formal statement. From the standpoint of formulations of the 
theory already in the literature, perhaps the most unusual feature of the 
present axioms is not to require that the set S of stimuli be finite. 

It should also be emphasized that for any one specific kind of detailed 
application additional specializing assumptions are needed. Some indi- 
cation of these will be given in the particular application to automata 
theory, but it would take us too far afield to explore these specializing 
assumptions in any detail and with any faithfulness to the range of 
assumptions needed for different experimental applications. 

Definition 7 • structure ^ = < S, R, E, |i, X, P> ^ stimulus - 
response model if and only if the following axioms are satisfied r 

Sampling Axioms 



SI. P(Ks^) > 0) = 1 . 

(On every trial a set of stimuli of positive measure is sampled 




with probability 1. ) 
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S2. P(s I T ) p(s |t ) . 

^ m' ^ n' n' 

( if the same presentation set occurs on two different trials , 
then the probability of a given sample is independent of the trial 
number . ) 

S3 . If s U s ’ c T ^ |x( s ) = |ji( s ' ) 

then P(s |t ) P(s* |t ) . 

( Samples of equal measure that are subsets of the presentation 

set have an equal probability of being sampled on _a given trial . ) 

S4. P(s |t ,Y (C )) P(s |t ) . 

( The probability of _a particular sample on trial n, given the 

presentation set of stimuli , is independent of any preceding pattern 

Y (C ) of events . ) 
n n 

Conditioning Axioms 

Cl. IS r,r’eR, r^^r’ and 7^ ^ > then 

P(C ) 0 . 

( On every trial with probability 1 each stimulus element is 
conditioned to at most one response . ) 

• There exists a c > 0 such that for every o’-, C , r , n , s , 



e , and Y 
r ’ n 



P(o- ec -|o-^c ,o-£s,e ,Y)=c. 

r,n+l‘ ^ r,n’ n’ r,n’ n' 



( The probability i®, ^ sampled stimulus element * s 

becoming conditioned to the reinforced response it not already so 
conditioned , and this probability is independent of the particular 
response , trial number , or any preceding pattern Y^ of events . ) 



o 
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C3. P(C Jc , e „) = 1 • 
n+1 n o,n' 

(with prohahility 1, the conditioning of all stimulus elements 
remains the same if no response i s reinforced o ) 

Ck. P(o' e C 

(with prohahility 1, the conditi oning of unsampled stimuli 
does not chtinge . ) 



'r ,n+l 



icreC ,o'^s,Y)-1. 
‘ r,n’ ^ n" n' 



Response Axioms 

Rl. U C n s 7 ^ 0 then 

reR 

n(snc^) 

P(rJC„, Sr^, Y(sJ) . 

(if ^ least one sampled stimulus is conditioned to some 
response , then the probability of any response is the ratio of the measure 
of sampled stimuli conditioned to this response measure of all 

th.d Sampled stimuli , and this probability is independent of any preceding 
pattern Y(s^) of events . 

R2. If LJc ris=0 then there is a number p ^ such that 
reR ^ 

p(r |c } s , Y(s )) = p . 
n' n’ n’ ^ n^' ^r 

( if no sampled stimulus is conditioned to any response , then 
the probability of any response r is a constant guessing probability 
Pj. that is independent of n and any preceding pattern Y(s^) of 
events . ) 

A general discussion of these axioms and their implications for a 
wide range of psychological experiments may be found in the references 
already cited. The techniques of analysis used in the next section of 
this paper are extensively exploited and applied to a number of experiments 
in Suppes and Atkinson (i960). 
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4 . Representation of Finite Automata 

A useful beginning for the analysis of how we may represent finite 
automata by stimulus-response models is to see what is wrong with the 
most direct approach possible. The difficulties that turn up may be 
illustrated by the simple example of a two-letter alphabet (i.e., two 
stimuli and as well as the "start-up” stimulus o"^) and a 

two-state automaton (i.e., two responses r^ and r^). Consideration 
of this example, already mentioned in the introductory section, will 
be useful for several points of later discussion. 

By virtue of Axiom SI, the single presented stimulus must be 
sampled on each trial, and we assume that for every n, 

0 < P(o- e s ), P(o- e s ), P(a-. e s ) < 1 . 

Suppose, further, the transition table of the machine is: 



r r 
1 2 



r O' 

1 1 

r O' 

1 2 

r O’ 
2 1 

r O’ 
2 2 



1 

0 

0 

1 



0 

1 

1 

0 



which requires knowledge of both and o’^ to predict what response 

should be next. The natural and obvious reinforcement schedule for 
imitating this machine is: 

1 
1 
1 



P(e 

P(e 

P(e 



l,n 


1 ^ r 

' l,n’^l,n-l 


'2,nl 


1 ^ T 

l,n’ 2,n-l 


'2,nl 


(T 

2,n’ l,n-l 


l,nl 


O' ,r 

2,n 2,n-l 



) = 
) = 
) = 
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•where o'. is the event of stimulus o'.'s "being sampled on trial n. 

1 , n 1 

But for this reinforcement schedule the conditioning of each of the two 
stimuli continues to fluctuate from trial to trial, as may "be illustrated 
"by the following sequence. For simplification and without loss of 
generality, we may assume that the conditioning parameter c is 1, and 
we need indicate no sampling, "because as already mentioned, the single 
stimulus element in each presentation set will "be sampled with pro'bahility 
1. We may represent the s-tates of conditioning (granted that each stimulus 



is conditioned to either r^ or t ^) ty subsets of S = j * 

Thus, if represents the conditioning function, this means both 

elements and conditioned to r^j (-4 means that only 

cr^ is conditioned to r^, and so forth. Consider then the following 
sequence from trial n to n + 2: 



h\ 



Cp c s , > ®2 ^ 0 ^ ® * ^2 ^ ^2 



> ->< 0, cr € s, rp, e, > 



The response on trial n + 1 satisfies the machine table, but already 

on n -I' 2 it does not, for rp n+l°"2,n+2 followed by 

It is easy to show that this difficulty is fundamental and arises for 
any of the four possible conditioning states. (In working out these 
difficulties explicitly, the reader should assume that each stimulus is 
conditioned to either r^ or rp, which will be true for n much 
larger than 1 and c = 1.) 

What is needed is a quite different definition of the states of the 
Markov chain of the stimulus-response model. (For proof of a general 



Markov-chain theorem for stimulus-response theory, see Estes and Suppes 



( 1959 ).) Naively, it is natural to take as the states of the Markov 
chain the possible states of conditioning of the stimuli in S, but this is 
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wrong on two counts ip the present situation. First, we must condition 
the patterns of responses and presentation sets, so we take as the set of 
stimuli for the model, R X S, i.e., the Cartesian product of the set 
R of responses and the set S of stimuli. Wliat the organism must he 
conditioned to respond to on trial n is the pattern consisting of the 
preceding response given on trial n - 1 and the presentation set 
occurring on trial n. 



in terms of the states of conditioning of the elements in R X S, because 

for reasons that are given explicitly and illustrated by many examples 

in Estes and Suppes (1959) and Suppes and Atkinson (i960), it is also 

necessary to include in the definition of state the response r^ that 

actually occurred on the preceding trial. The difficulty that arises if 

r. , is not included in the definition of state may be brought out by 
i,n-i 

attempting to draw the tree in the case of the two-state automaton 
already considered. Suppose just the pattern is conditioned, 

and the other four patterns, ^2^2’ 

us represent this conditioning state by C^, and let t. be the 



It is still not sufficient to define the states of the Markov chain 




T. > 0. Then the tree looks like this. 



T 



Q 





(T 
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The tree is incomplete, "because without knowing what response actually- 
occurred on trial n - 1 we cannot complete the branches (e.g., specify 
the responses), and for a similar reason we cannot determine the 
probabilities x, y and z„ Moreover, we cannot remedy the situation 
by including among the branches the possible responses on trial n - 1, 
for to determine their probabilities we would need to look at trial 
n - 2, and this regression would not terminate until we reached trial 1. 

So we include in the definition of state the response on trial 
n - 1. On the other hand, it is not necessary in the case of deter- 
ministic finite automata to permit among the states all possible conditioning 
of the patterns in R x S. We shall permit only two possibilities — the 
pattern is unconditioned or it is conditioned to the appropriate response 
because conditioning to the wrong response occurs with probability zero. 

Thus with p internal states or responses and m letters in 2J , there 
are (m + l)p patterns, each of which is in one of two states, 
conditioned or unconditioned, and there are p possible preceding 
responses, so the number of states in the Markov chain is , 

Actually, it is convenient to reduce this number further by treating 
(T^ as a single pattern regardless of what preceding response it is 
paired with. The number of states is then p2’^^'^^. Thus, for the 
simplest 2-state, 2-alphabet automaton, the number of states is 64, 

We may denote the states by ordered mp-'-2.» tuples 






r ■ , r • r i ^ , 

o' o' 0,1' ' o,m' - p-l,m ' 

where is 0 or 1 depending on whether the pattern is 

unconditioned or conditioned with C < k < p - 1 and 1 < i < m 5 

r is the response on the preceding trial, and i is the state of 
J o 
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conditioning of What we want to prove is that starting in the 

purely unconditioned set of states < r.,0,0,.,.,0 >,with prohahility 

(3 

1 the system will always ultimately he in a state that is a member 
of the set of fully conditioned states < > . The proof 

of this is the main part of the proof of the basic representation theorem. 

Before turning to the theorem we need to define explicitly the 
concept of a stimulus-response model's asymptotically becoming an 
automaton. As has already been suggested, an important feature of this 
definition is this. The basic set S ' of stimuli corresponding to the 
alphabet 2 of the automaton is not the basic set of stimuli of the 
stimulus-response model, but rather, this basic set is the Cartesian 
product R X S, where R is the set of responses. Moreover, the 
definition has been framed in such a way as to permit only a single 
element of S to be presented and sampled on each trial; this, however, 
is an inessential restriction used here in the interest of conceptual 
and notational simplicity. Without this restriction the basic set would 
be not R X S, but R X P(S), where P(S) is the power set of S, i.e., 
the set of all subsets of S, and then each letter of the alphabet 2 
would be a subset of S rather than a single element of S. What is 
essential is to have R X S rather than S as the basic set of stimuli 
to which the axioms of Definition 6 apply. 

For example, the pair (r^, o’j ) must be sampled and conditioned 

as a pattern, and the axioms are formulated to require that what is 

sampled and conditioned be a subset of the presentation set T on a 

given trial. In this connection to simplify notation I shall often 

write T = (r . ) rather than 

n ^ i,n-l* J,n' 
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"but the mearj.i':?.g is clear. T is the presentation set consisting of the 

n 

single pattern (or element) made up of response r^ on trial n - 1 and 

stimulus element ^ . on trial n, and from Axiom SI we know that the 

J 

pattern is sampled "because it is the only one presented. 

From a psychological standpoint something needs to he said about 
part of the presentation set being the previous response. In the first 
place, and perhaps most importantly, this is not an ad hoc idea adopted 
just for the purposes df this paper. It has already been used in a 
number of experimental studies unconnected with automata theory. Several 
worked-out examples are to be found in various chapters of Suppes and 
Atkinson (1960). 

Secondly, and more importantly, the use of R X S is formally 
convenient, but is not at all necessary. The classical S-R tradition 
of analysis suggests a formally equivalent, but psychologically more 
realistic approach. Each response r produces a stimulus or more 

generally, a set of stimuli. Assuming again, for formal simplicity just 
one stimulus element rather than a set of stimuli, we may replace 

R by the set of stimuli S„, with the purely contingent presentation 
schedule 



and in the model we now consider the Cartesian product S rather 

than R X S. Within this framework the important point about the 
presentation set on each trial is that one component is purely subject 
controlled and the other purely experimenter-controlled — if we use 
familiar experimental distinctions. The explicit use of rather 

than R promises to be important in training animals to perform like 
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automata, "because the external introduction of reduces directly 

3 

and significantly the memory load on the animal. The importance of 

for models of children’s language learning is less clear. 

R 

Definition 8. = < R X S, R, E, X, P > ^ a stimulus - 

response model where 

R = 

E = . . . , e^^^ J , 

and H-(S ' ) the cardinality of S ’ for S ' c S . Then asymptotically 

"becomes the automaton OT(i^) = ■< R> S - ^q> E > and only if 

(i) as n 00 the pro"b ability is 1 that the presentation set 

T is (r, ) for some i and J , 

n — ^ i,n-l’ j,n^ 

(ii) M(r^, <"j) = if ^ onljr^ lim P(r^^^|T^ = 

= 1 for 0 < i < p - 1 and 1 < J < m , 

(iii) Xim P(r iT = (r, ^ , o' „)) = 1 for 0 < i < p- 1 , 

• n->oo ^ o,n‘ n ^ i,n-l’ o,n" - - 

(iv) F c R . 



A minor but clarifying point about this definition is that the 

stimulus 0-^ is not part of the alphabet of the automaton 0C(v</), 

because a stimulus is needed to put the automaton in the initial state 

r , and from the standpoint of the theory being worked out here, this 

o 

if 

requires a stimulus to which the organism will give response r^. That 
stimulus is 0-^. The definition also requires that asymptotically the 
stimulus-response model is nothing but the automaton It 

should be clear that a much weaker and more general definition is 
possible. The automaton 0({>J) could merely be embedded asymptotically 
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in J and be only a part of the activities of . The simplest way to 



achieve this generalization is to make the alphabet of the automaton 



that make up the internal states of the automaton; they need be only a 
proper subset of the full set R of responses. This generalization 
will not be pursued herej although something of the sort will clearly 
be necessary to give an adequate stimulus-response account of the 
semantical aspects of language. 

Representation Theorem for Finite Automata . Given any connected 
finite automaton , there is a stimulus-response model that asymptotically 
becomes isomorphic to it . 

Proof ; Let (K='<A, ^ M, s^, F > be any connected finite 
automaton. As indicated already, we represent the set A of internal 
states by the set R of responses; we shall use the natural correspondence 



represent the alphabet S by the set of stimuli and, for 

reasons already made explicit, we augment this set of stimuli by 
to obtain 



that establishes the natural one-one correspondence between A and R., 
and between ^ and S - ^ avoid some trivial technical points 
I shall assume that A and 2 are disjoint.) 

We take as the set of reinforcements 



only a proper subset of 




and correspondingly for the responses 



I 



1 




For subsequent reference let f be the function defined on A U ^ 



t 
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and the measure h**(S') is the cardinality of S' for S' S, so that 
as in Definition 8, i^/e are considering a stimulus-response model 

= < R X S, R, E, X, P > . In order to shoi^/ that asymptotically 

"becomes an automaton, "we impose five additional restrictions on . 

They are these. 

First, in the case of reinforcement e the schedule is this: 

' o 



P(e 



o,n o,n 



) = 1 , 



( 1 ) 



i.e,, if o’q is part of the presentation set on trial n, then i^/ith 
pro’bahility 1 response r^ is reinforced — note that the reinforcing 



event e is independent of the actual occurrence of the event r 
o,n 

Second, the remaining reinforcement schedule is defined "by the 
transition ta"ble M of the automaton CXi. Explicitly, for j, k / 0 



o,n 



.-1 



,-l, 



and for all i and n 

Third, essential to the proof is the additional assumption "beyond 
(1) and (2) that the stimuli cr^, .,,, cr^ each have a positive, non- 
contingent pro"ba"bility of occurrence on each trial (a model "with a 
■weaker assumption could "be constructed "but it is not significant to 
weaken this requirement). Explicitly, we then assume that for any 



cylinder set such that P(y(C^)) > 0 



P(cr. ) = P((T. |y(C )) > T. 
^ i,n'^ ^ i.,n' ^ n' ' — i 



> 0 



for 0 < i < m and for all trials n. 

Fourth, we assume that the pro"ba"bility of response r^ 

occurring when no conditioned stimuli is sampled is also strictly 
positive, i.e., for every response r^ 



( 2 ) 



(3) 



o 

ERIC 



33 



P^. > 0 > 






•which strengthens Axiom R2, 

Fifth, for each integer k, ° < mp+1, we define the set 

as the set of states that have exactly k patterns conditioned, and 
^ is the event of being in a state that is a member of on trial 

n. We assume that at the beginning of trial 1, no patterns are condi- 



tioned, i.e.. 



P(Qo,i) = 1 • 



(5) 



It is easy to prove that given the sets R, S, E and the 
cardinality measure there are many different stimulus-response 
models satisfying restrictions (l) - (5)> but for the proof of the 
theorem it is not necessary to select Some distinguished member of the 
class of models because the argument that follows shows that all the 
members of the class asymptotically become isomorphic to OC. 

' The main thing we want to prove is that as n » 

We first note that if j < k the probability of a transition from 
to 0,^ is zero, i.e.. 



moreover, 






(8) 



even if d > k unless d = k + 1. In other words, in a single trial, 
at most one pattern can become conditioned. 

To show that asymptotically (6) holds, it will suffice to show 
that there is an € > 0 such that on each trial n for 0<k<mp<n 



er|c 



3h 



if 

\ 

To establish (9) "we need to show that there is a probability of at least 
€ of a stimulus pattern that is unconditioned at the beginning of 
trial n becoming conditioning on that trial. The argument given will 

-X* “X* 

be a uniform one that holds for any unconditioned pattern. Let r (t 
be such a pattern on trial n. 

Now it is well known that for a connected automaton, for every 

internal state s, there is a tape x such that 

M(s^,x) = s (10) 

and the length of x is not greater than the number of internal states. 

In terms of stimulus-response theory, x is a finite sequence of 

length not greater than p of stimulus elements. Thus we may take 

* / \ 

X = O'. , O'. with cr^ = cr . We know by virtue of (3) that 

1 



min 

0 < i < m 



= T > 0 . 



( 11 ) 



The required sequence of responses r. , r. will occur' either 

^1 p"-*- 

from prior conditioning or if any response is not conditioned to the 



appropriate pattern, with guessing probability p^. By virtue of {k) 



min 

0 < i < p-1 
* * 



= p > 0 . 



( 12 ) 



To show that the pattern r cr has a positive probability €, of being 
conditioning on trial n, we need only take n large enough for the 
tape X to be "run, " say, n > p + 1, and consider the joint probability 

■X* y -X- -X* 

n f ^ A 5^4 •••••O'* • ^ « 

-1 It 1 ^ ^ ^ 0,n-i _ _ 

p-l,n-l p-2,n-2 1 p ’ p P 



* . -K- * 

P = P((T ,r , 0 - 
n’ n-1’ i 




( 13 ) 
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The basic axioms and the assumptions (l)-(5) determine a lower bound on 

,y, 

P independent of n. First we note that for each of the stimulus 
elements (Tq, o-^ , by virtue of (3) and (ll) 



P((T 



n 



) > 






> P("0,n-i -l) > 

’ P 



Similarly, from (4) and (12), as well as the response axioms, we know 

•X* 

that for each of the responses r^, r^ , r 



^^^0,n-i -1^^0,n-i -1^ 



Thus we know that 



p* > 






and given the occurrence of the event probability of 

conditioning is c, whence we may take 

€ = opPxP^^ > 0 , 



which establishes (9) and completes the proof. 

Given the theorem just proved there are several significant corollaries 
whose proofs are almost immediate. The first combines the representation 
theorem for regular languages with that for finite automata to yield; 



Corollary on Regular Languages . Any regular language generated 
some stimulus-response model at asymptote . 

Once probabilistic considerations are made a fundamental part of the 
scene, we can in several different ways go beyond the restriction of 
stimulus-response generated languages to regular languages, but I shall 
not explore these matters here. 
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I suspect thai: many psychologists or philosophers "who are "willing 
to accept the sense given here to the reduction of finite automata and 
regular languages to stimulus-response models v/ill "be less happy with 
the claim that one well-defined sense of the concepts of intention , 
plan and purpose can he similarly reduced. However, without any substantial 
new analysis on my part this can he done hy taking advantage of an analysis 
already made hy Miller and Chomsky (1963). The story goes like this. 

In i960 Miller, Galanter and Prihram published a provocative hook . 
entitled Plans and the Structure of Behavior . In this hook they severely 
criticized stimulus-response theories for being able to account for so 
little of the significant behavior of men and the higher animals. They 
especially objected to the conditioned reflex as a suitable concept 
for building up an adequate scientific psychology. It is my impression 
that a number of cognitively oriented psychologists have felt that the 
critique of S-R theory in this book is devastating. 

As I indicated in the introductory section, I would agree that 
conditioned reflex experiments are indeed far too simple to form an 
adequate scientific basis for analyzing more complex behavior. This 
is as hopeless as would be the attempt to derive the theory of 
differential equations, let us say, from the elementary algebra of sets. 

Yet the more general theory of sets does encompass in a strict mathematical 
sense the theory of differential equations. 

The same relation may be shown to hold between stimulus-response 
theory and the theory of plans, insofar as the latter theory has been 
systematically formulated by Miller and Chomsky.^ The theory of plans is 
formulated in terms of tote units ("tote" is an acronym for the cycle 



o 



test-operate-test-exit ) . . A plan is then defined as a tote hierarchy, 
■which is just a form of oriented graph, and every finite oriented 
graph may be represented as a finite automaton. So we have the result: 

Corollary . Any tote hierarchy in the sense of Miller and Chomsky 
is isomorphic to some stimulus - response model at asymptote . 

5. Representation of Probabilistic Automata 

From the standpoint of the kind of learning models and experiments 
characteristic of the general area of what has come to be termed prob - 
ability learning , there is near at hand a straightforward approach to 
proba-bilistic automata. It is worth illustrating this approach, but it 
is perhaps even more desirable to discuss it with some explicitness in 
order to show why it is not fully satisfactory, indeed for most 
purposes considerably less satisfactory than a less direct approach that 
follows from the representation of deterministic finite automata already 
discussed. 

The direct approach is dominated by two features: a probabilistic 

reinforcement schedule and the conditioning of input stimuli rather than 
response-stimulus patterns. The main simplification that results from 
these features is that the number of states of conditioning and conse- 
quently the number of states in the associated Markov chain is reduced. 

A two-letter, two- state probabilistic automaton, for example, requires 
36 states in the associated Markov chain, rather than 64, as in the 
deterministic case. We have, as before, the three stimuli and 

0"^ and their conditioning possibilities, 2 for ^ as before, but 
now, in the probabilistic case, 3 for and and we also need 




38 



as part of the state, not for purposes of conditioning, hut in order to 
make the response-contingent reinforcement definite, the previous 
response, which is always either or r^. Thus we have 

2*3*3*2 = 38 , Construction of the trees to compute transition 
probabilities for the Markov chain follows closely the logic outlined 
in the previous section. We may define the probabilistic reinforcement 
schedule by two equations, the first of which is deterministic and play 
exactly the same role as previously: 

p(e, |o" ) = 1 

‘^^l,n' o,n-^ 

and 



P(e 



l,n 



r . 
,n j 



,n-l) = 






for 1 < i,j £ 2. 

The fundamental weakness of this setup is that the asymptotic 
transition table representing the probabilistic automaton only holds 
in the mean. Even at asymptote the transition values fluctuate from 
trial to trial depending upon the actual previous reinforcement, not 
the probabilities ■jt. .. Moreover, the transition table is no longer 
the transition table of a Markov process. Knowledge of earlier 



responses and reinforcements will lead to a different transition table, 
whenever the number of stimuli representing a letter of the input 
alphabet is greater than one. These matters are well known in the 
large theoretical literature of probability learning and will not be 
developed further here. 

For most purposes of application it seems natural to think of 
probabilistic automata as a generalization of deterministic automata 
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intended to handle the prohlem of errors. A similar consideration of 
errors after a concept or skill has "been learned is common in learning 
theory. Here is a simple example. In the standard version of all-or- 
none learning, the organism is in either the unconditioned state (U) or 
the conditioned state (c)= The transition matrix for these states is 



C U 
CIO 

u c 1-c 



and, consistent with the axioms of Section 3 , 



P(Correct response |c) = 1 


(1) 


P(Correct response |u) = p 


(2) 


P(U^) = 1 , 





i.e., the probability of being in state U on trial 1 is 1. Now by 
changing (2) to 

P(Correct response |c) = 1 - c , 

for c > 0, we get a model that predicts errors after conditioning has 
occurred. 

Without changing the axioms of Section 3 v?e can incorporate such 
probabilistic-error considerations into the derivation of a representation 
theorem for probabilistic automata. One straightforward procedure is 
to postulate that the pattern sampled on each trial actually consists of 
N elements, and that in addition M background stimuli common to all 
trials are sampled, or available for sampling. By specializing further 
the sampling axioms SI - s4 and by adjusting the parameters M and 
N, we can obtain any desired probability € of an error. 
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Because it seems desirable to develop the formal results with intended 
application to detailed learning data, I shall not state and prove a 
representation theorem for probabilistic automata here, but restrict 
myself to considering one example of applying a probabilistic automaton 
model to asymptotic performance datao The formal machinery for analyzing 
learning data will be developed in a subsequent paper. 

The example I consider is drawn from arithmetic. For more than 
three years we have been collecting extensive data on the arithmetic 
performance of elementary-school students, in the context of various 
projects on computer-assisted instruction in elementary mathematics. 

Prior to consideration of automaton models, the main tools of analysis 
have been linear regression models. The dependent variables in these 
models have been the mean probability of a correct response to an item 
and the mean success latency. The independent variables have been 
structural features of items, i,e., arithmetic problems, that may be 
objectively identified independently of any analysis of response data. 
Detailed results for such models are to be found in Suppes, Hyman 
and Jerman (1967) and Suppes, Jerman and B-^ian (1968). The main 
conceptual weakness of the regression models is that they do not provide 
an explicit tempoial analysis of the steps being taken by a student in 
solving a problem. They can identify the main variables but not connect 
these variables in a dynamically meaningful way. In contrast, analysis 
of the temporal process of problem solution is a natural and integral 
part of an automaton model. 

An example that Is typical of the skills and concepts encountered 
in arithmetic is column addition of two integers. For simplicity I 
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shall consider only prohlems for ■which the two gi'ven numbers and their 
sum all have the same number of digits. It will be useful to begin by 
defining a deterministic automaton that will perform the desired addition 
by outputting one digit at a time reading from right to left, just as 
the students are required to do at computer-based teletype terminals. 

For this purpose it is convenient to modify in inessential ways the 
earlier definition of an automaton. An automaton will now be defined 
as a structure (DC = < A, M, Q, s^ > where A, and 

are non-empty finite sets, with A being the set of internal states 
as before, the input alphabet, and the output alphabet. 

Also as before, M is the transition function mapping A X into 
A, and s is the initial state. The function Q is the output 

function mapping A X into ^q. 

For column addition of two integers in standard base ten 

representation, an appropriate automaton is the following. 

A = (0,1) , 

= { (m,n) : 0 < m,n < 9) 

(0, 1, ...j 9) 

Co if m+n + k<9- 

M(k,(m,n)) = < 

[_1 if m -f n + k > 9, for k = 0,1 . 

Q(k,(m,n)) = (k + m + n)mod 10 . 

s " 0 . 
o 

Thus the automaton operates by adding first the ones' column, storing 
as internal state 0 if there is no carry, 1 if there is a carry, 
outputting the sum of the ones' column modulus 10, and then moving on 
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to the input of the two tens’ column digits, etc. The initial internal 
state s^ is 0 "because at the "beginning of the pro"blem there is no 
"carry, " 

For the analysis of student data it is necessary to move from a 
deterministic to a pro"ba"bilistic automaton. The num"ber of possi"ble 
parameters that can "be introduced is uninterestingly large. Each 
transition M(k, (m,n)) may he replaced hy a prohahilistic transition 

^ " \,m,n \,m,n» output Q(k(m,n)), hy ten prohahilities 

for a total of 2^200 parameters. Using the sort of linear regression model 
described ahove we have found that a fairly good account of student 
performance data can he obtained hy considering two structural variables, 
C^, the number of carries in problem item i,> and D^, the number of digits 
or columns. Let p^ be the mean probability of a correct response on 
item i and let 



i-Pi 

Zi = log — . 

The regression model is then characterized by the equation 

- Q!q + ^^i * 

and the coefficients a^, and estimated from the data. 

A similar three-parameter automaton model is structurally very 
natural. First, two parameters, € and n, are introduced according to 
whether there is a "carry" to the next column. 

P(M(k,(m,n)) = o|k+m+n < 9) = 1 - € 

and 

P(M(k,(m,n)) = l|k+m+n > 9 ) t= l - q . 

In other words, if there is no "carry," the probability of a correct 
transition is 1 - € and if there is a "carry" the probability of such 
a transition is 1 - t). The bhird parameter, . 7 , is simply the probability 
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of an output error. Conversely, the prohahility of a correct output is: 



Consider no"w prohlem i "with C, carries and D, digits. If "we 

11 

ignore the prohahility of t-wo errors leading to a correct response--e . go , 
a transition error folloived hy an output eiror--then the prohahility of a 
correct ans"wer is just 



As already indicated it is important to realize that this equation is an 
approximation of the "true” prohahility. However, to compute the exact 
prohahility it is necessary : to: make a definite assumption about how the 
prohahility y of an output error is distributed among the 9 possible 
wrong responses. A simple and intuitively appealing one-parameter model 
is the one that arranges the 10 digits on a circle in natural order with 
9 next to 0, and then makes the prohahility of an error j steps to 
the right or left of the correct response 6^. For example, if 5 is 
the correct digit, then the prohahility of responding 4 is 6, of 3 

is 6^, of 2 is 6^, of 1 is 6^, of 0 is 6^, of 6 is 6, of 

2 

7 is 6 , etc. Thus in terms of the original model 



Here the additional term is because if the state entered is 0 rather 



P(0,(k, (m,n) ) = (k+m+n) mod lO) = 1 - 7 , 



Di C^ 

P(Correct Answer to Prohlem i) = (l~7) (l“^) 




2 Q k ^ 5 

7 = 2.(6 + 6 + 6^ + 6 ) + 6-^ . 



Consider now the prohlem 



4T 

+ 15 . 



Then, where d. = the i 

’ 1 



. th 



digit response. 



P(dj^=2) = (1-7) 

P(dg=6) = + ri6 . 
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than 1 when the pair (7 >5) is input, the only way of obtaining a correct 
answer is for 6 to he given as the sum of 0+4+1, which has a 
probability 6. Thus the probability of a correct response to this 
problem is (I-7) [ (I-7) (I-t)) + t^b] . Hereafter we shall ignore the 
"HB (or €B) terms. 

Returning to equation (2) we may get a direct comparison with the 
linear regression model defined by equation (l), if we taJie the logarithm 
of both sides to obtain; 

log p^ = log ( 1 - 7 ) + log (1-Tl) + (D^-C^-1) log (1-6) , (3) 

and estimate log I-7, log 1 -ti, and log 1-6 by regression with the 

additive constant set equal to zero. We also may use some other approach 

2 

to estimation such as minimum X or maximum likelihood. An analytic 
solution of the standard maximum-likelihood equations is very messy 
indeed, but the maximum of the likelihood function can be found 
numerically . 

The automaton model naturally suggests a more detailed analysis of 
the data. Unlike the regression model, the automaton provides an 
immediate analysis of the digit-by-digit responses. Ignoring the 
c6-type terms, we can in fact find the general maximum-likelihood 
estimates of 7, 6, and "n when the response data are given in this 
more explicit form. 

Let there be n digit responses n a block of problems. For 

1 < i < n let X. be the random variable that assumes the va,lue 1 if 
— — — 

the i^^ response is correct and 0 otherwise. It is then easy to see that 

if i is a ones':- column digit 

if it is not a ones' column and there is no 
carry to the i^^ digit 

th 

if there is a carry to the i digit , 



(1-7) 

(1-7)(1-ti) 




granted that €6-type terms are ignored. Similarly for the same three 
alternatives 



p(^ = 0) = <r 1 



(l-7)(l-e) 
(1-7)(1 -ti) . 



So for a string of actual digit responses x^, . . . , -we can -write 
the likelihood function as : 










-where a = number of correct responses, b = number of incorrect responses 
in the ones’ column, c = number of correct responses not in the ones’ 
column vhen the internal state is 0, d = number of correct responses 
-when the internal state is 1, e = number of incorrect responses not 
in the ones’ column -when the internal state is 0, and f = number of 
incorrect responses -when the internal state is 1. It is more convenient 
to estimate 7 '=l- 7 , €’=l-e, and = l - Tj. Making this change, 
taking the log of both sides of (4) and differentiating -with respect to 
each of the variables, -we obtain three equations that determine the maximum' 
likelihood estimates of 7', €', and t]' : 

a b e€' . 



w 


r ~ 


1-7' 


1 - 7 * € 




c 


ey' 


0 

It 




~ 'p' ~ 


1 

1 — 1 




d 


f 7 ' 


= 0 . 




= ^ " 


1-7' T1 ’ 



Solving these equations, -we obtain as estimates: 
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tf' - 

(c+e ) (a-c-d7 ’ 

^ , d ( a+b-c-d) 

^ “ (d+f ) (a-c-d ) 

The most interesting feature of these estimates is that 7' 
ratio of correct responses to total responses in the ones' column. 

The two equations that yield estimates of c' and il are especially 
transparent if they are rewritten: 

(i-r)(i-0 = 7'*' = ^ . 

(l-7)(l-n) = 7’'l’ = . 

Additional analysis of this example will not be pursued here. I do 
want to note that the internal states 0 and 1 are easily exter- 
nalized as oral responses and most teachers do indeed require such 

externalization at the beginning. 

To many readers the sort of probabilistic automaton just analyzed 
will seem far removed from the sort of device required to account for 
language behavior • Certainly the automata that are adequate to 
analyze arithmetic are simpler in structure than what is needed e'ven 
for the language output of a two-year old child. What I wanted to 
illustrate is how we can bear down on detailed analysis of data with 
specific automaton models. Hopefully, we shall in the future be able 
to move to an equally detailed analysis of child language behavior, 
but I have no illusions about the difficulty of the problems yet to be 




solved. 
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Footnotes 



1 
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Kenneth N. Wexler for a number of helpful comments on the manuscript. 



successfully trained two pigeons. Sequences of several hundred responses 
are easily obtained, and the error rate is surprisingly low — well under 
one per cent. We are now tackling the training schedule of a three- 
state automaton. 



experiments with Diebold and Ebbesen mentioned in footnote 2. 

1 + 

Two other points about the definition are the following. First, 
in this definition and throughout the rest of the article e^ is the 
reinforcement of response r^, and not the null reinforcement, as in 
Section 3* This conflict arises from the different notational conventions 
in mathematical learning theory and automata theory. Second, strictly 
speaking I should write OC^{\J ) because F is not uniquely determined 
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Indeed, Phoebe C. E. Diebold and Ebbey Bruce Ebbesen have already 



3 

Exactly such a proceduie has proved very successful in the pigeon 




5 

After this was written Gordon Bower brought to my attention the 
article by Millenson ( 1967 ) that develops this point informally. 
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